Page 1 of 2

2023 Journal article Open Access

Early exit strategies for learning-to-rank cascades
Busolin F., Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
The ranking pipelines of modern search platforms commonly exploit complex machine-learned models and have a significant impact on the query response time. In this paper, we discuss several techniques to speed up the document scoring process based on large ensembles of decision trees without hindering ranking quality. Specifically, we study the problem of document early exit within the framework of a cascading ranker made of three components: 1) an efficient but sub-optimal ranking stage; 2) a pruner that exploits signals from the previous component to force the early exit of documents classified as not relevant; and 3) a final high-quality component aimed at finely ranking the documents that survived the previous phase. To maximize speedup and preserve effectiveness, we aim to increase the accuracy of the pruner in identifying non-relevant documents without early exiting documents that are likely to be ranked among the final top-k results. We propose an in-depth study of heuristic and machine-learning techniques for designing the pruner. While the heuristic technique only exploits the score/ranking information supplied by the first sub-optimal ranker, the machine-learned solution named LEAR uses these signals as additional features along with those representing query-document pairs. Moreover, we study alternative solutions to implement the first ranker, either a small prefix of the original forest or an auxiliary machine-learned ranker explicitly trained for this purpose. We evaluated our techniques through reproducible experiments using publicly available datasets and state-of-the-art competitors. The experiments confirm that our early-exit strategies achieve speedups ranging from 3× to 10× without statistically significant differences in effectiveness.Source: IEEE access 11 (2023): 126691–126704. doi:10.1109/ACCESS.2023.3331088
DOI: 10.1109/access.2023.3331088
Metrics:

See at: CNR ExploRA

2022 Journal article Open Access

Distilled neural networks for efficient learning to rank
Nardini F. M., Rulli C., Trani S., Venturini R.
Recent studies in Learning to Rank have shown the possibility to effectively distill a neural network from an ensemble of regression trees. This result leads neural networks to become a natural competitor of tree-based ensembles on the ranking task. Nevertheless, ensembles of regression trees outperform neural models both in terms of efficiency and effectiveness, particularly when scoring on CPU. In this paper, we propose an approach for speeding up neural scoring time by applying a combination of Distillation, Pruning and Fast Matrix multiplication. We employ knowledge distillation to learn shallow neural networks from an ensemble of regression trees. Then, we exploit an efficiency-oriented pruning technique that performs a sparsification of the most computationally-intensive layers of the neural network that is then scored with optimized sparse matrix multiplication. Moreover, by studying both dense and sparse high performance matrix multiplication, we develop a scoring time prediction model which helps in devising neural network architectures that match the desired efficiency requirements. Comprehensive experiments on two public learning-to-rank datasets show that neural networks produced with our novel approach are competitive at any point of the effectiveness-efficiency trade-off when compared with tree-based ensembles, providing up to 4x scoring time speed-up without affecting the ranking quality.Source: IEEE transactions on knowledge and data engineering (Online) 35 (2022): 4695–4712. doi:10.1109/TKDE.2022.3152585
DOI: 10.1109/tkde.2022.3152585
Metrics:

See at: ISTI Repository Open Access | ieeexplore.ieee.org Restricted | CNR ExploRA

2022 Conference article Closed Access

Ensemble model compression for fast and energy-efficient ranking on FPGAs
Gil-Costa V., Loor F., Molina R., Nardini F. M., Perego R., Trani S.
We investigate novel SoC-FPGA solutions for fast and energy-efficient ranking based on machine-learned ensembles of decision trees. Since the memory footprint of ranking ensembles limits the effective exploitation of programmable logic for large-scale inference tasks, we investigate binning and quantization techniques to reduce the memory occupation of the learned model and we optimize the state-of-the-art ensemble-traversal algorithm for deployment on low-cost, energy-efficient FPGA devices. The results of the experiments conducted using publicly available Learning-to-Rank datasets, show that our model compression techniques do not impact significantly the accuracy. Moreover, the reduced space requirements allow the models and the logic to be replicated on the FPGA device in order to execute several inference tasks in parallel. We discuss in details the experimental settings and the feasibility of the deployment of the proposed solution in a real setting. The results of the experiments conducted show that our FPGA solution achieves performances at the state of the art and consumes from 9 × up to 19.8 × less energy than an equivalent multi-threaded CPU implementation.Source: ECIR 2022 - 44th European Conference on IR Research, pp. 260–273, Stavanger, Norway, 10-14/04/2022
DOI: 10.1007/978-3-030-99736-6_18
Metrics:

See at: doi.org Restricted | link.springer.com | CNR ExploRA

2022 Conference article Open Access

A federated cloud solution for transnational mobility data sharing
Carlini E., Chevalier T., Dazzi P., Lettich F., Perego R., Renso C., Trani S.
Nowadays, innovative digital services are massively spreading both in the public and private sectors. In this work we focus on the digital data regarding the mobility of persons and goods, which are experiencing exponential growth thanks to the significant diffusion of telecommunication infrastructures and inexpensive GPS-equipped devices. The volume, velocity, and heterogeneity of mobility data call for advanced and efficient services to collect and integrate various data sources from different data producers. The MobiDataLab H2020 project aims to deal with these challenges by introducing an efficient and highly interoperable digital framework for mobility data sharing. In particular, the project aims to propose to the mobility stakeholders (i.e., transport organising authorities, operators, industry, governments, and innovators) reproducible methodologies and sustainable tools that can foster the development of a data-sharing culture in Europe and beyond. This paper introduces the key concepts driving the design and definition of a cloud-based data-sharing federation we call the Transport Cloud platform, which represents one of the main pillars of the MobiDataLab project. Such platform aims to ensure transnational access to mobility data in a secure, efficient, and seamless way, and to ensure that FAIR principles (i.e., mobility data should be findable, accessible, interoperable, and reusable) are enforced.Source: SEBD 2022 - 30th Italian Symposium on Advanced Database Systems, pp. 586–592, Tirrenia, Pisa, Italy, 19-22/06/2022
Project(s): ACCORDION via OpenAIRE

, MobiDataLab via OpenAIRE

See at: ceur-ws.org Open Access | ISTI Repository | CNR ExploRA

2022 Contribution to conference Open Access

Energy-efficient ranking on FPGAs through ensemble model compression (Abstract)
Gil-Costa V., Loor F., Molina R., Nardini F. M., Perego R., Trani S.
In this talk, we present the main results of a paper accepted at ECIR 2022 [1]. We investigate novel SoC-FPGA solutions for fast and energy-efficient ranking based on machine learned ensembles of decision trees. Since the memory footprint of ranking ensembles limits the effective exploitation of programmable logic for large-scale inference tasks [2], we investigate binning and quantization techniques to reduce the memory occupation of the learned model and we optimize the state-of-the-art ensemble-traversal algorithm for deployment on lowcost, energy-efficient FPGA devices. The results of the experiments conducted using publicly available Learning-to-Rank datasets, show that our model compression techniques do not impact significantly the accuracy. Moreover, the reduced space requirements allow the models and the logic to be replicated on the FPGA device in order to execute several inference tasks in parallel. We discuss in details the experimental settings and the feasibility of the deployment of the proposed solution in a real setting. The results of the experiments conducted show that our FPGA solution achieves performances at the state of the art and consumes from 9× up to 19.8× less energy than an equivalent multi-threaded CPU implementation.Source: IIR 2022 - 12th Italian Information Retrieval Workshop 2022, Tirrenia, Pisa, Italy, 19-22/06/2022

See at: ceur-ws.org Open Access | ISTI Repository | CNR ExploRA

2021 Journal article Closed Access

Efficient traversal of decision tree ensembles with FPGAs
Molina R., Loor F., Gil-Costa V., Nardini F. M., Perego R., Trani S.
System-on-Chip (SoC) based Field Programmable Gate Arrays (FPGAs) provide a hardware acceleration technology that can be rapidly deployed and tuned, thus providing a flexible solution adaptable to specific design requirements and to changing demands. In this paper, we present three SoC architecture designs for speeding-up inference tasks based on machine learned ensembles of decision trees. We focus on QuickScorer, the state-of-the-art algorithm for the efficient traversal of tree ensembles and present the issues and the advantages related to its deployment on two SoC devices with different capacities. The results of the experiments conducted using publicly available datasets show that the solution proposed is very efficient and scalable. More importantly, it provides almost constant inference times, independently of the number of trees in the model and the number of instances to score. This allows the SoC solution deployed to be fine tuned on the basis of the accuracy and latency constraints of the application scenario considered.Source: Journal of parallel and distributed computing (Print) 155 (2021): 38–49. doi:10.1016/j.jpdc.2021.04.008
DOI: 10.1016/j.jpdc.2021.04.008
Metrics:

See at: Journal of Parallel and Distributed Computing Restricted | Journal of Parallel and Distributed Computing | CNR ExploRA

2021 Conference article Open Access

Learning early exit strategies for additive ranking ensembles
Busolin F., Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
Modern search engine ranking pipelines are commonly based on large machine-learned ensembles of regression trees. We propose LEAR, a novel - learned - technique aimed to reduce the average number of trees traversed by documents to accumulate the scores, thus reducing the overall query response time. LEAR exploits a classifier that predicts whether a document can early exit the ensemble because it is unlikely to be ranked among the final top-k results. The early exit decision occurs at a sentinel point, i.e., after having evaluated a limited number of trees, and the partial scores are exploited to filter out non-promising documents. We evaluate LEAR by deploying it in a production-like setting, adopting a state-of-the-art algorithm for ensembles traversal. We provide a comprehensive experimental evaluation on two public datasets. The experiments show that LEAR has a significant impact on the efficiency of the query processing without hindering its ranking quality. In detail, on a first dataset, LEAR is able to achieve a speedup of 3x without any loss in NDCG@10, while on a second dataset the speedup is larger than 5x with a negligible NDCG@10 loss (< 0.05%).Source: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2217–2221, Online conference, 11-15/07/ 2021
DOI: 10.1145/3404835.3463088
Metrics:

See at: arXiv.org e-Print Archive Open Access | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari | dl.acm.org Restricted | dl.acm.org | CNR ExploRA

2020 Conference article Open Access

Query-level early exit for additive learning-to-rank ensembles
Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
Search engine ranking pipelines are commonly based on large ensembles of machine-learned decision trees. The tight constraints on query response time recently motivated researchers to investigate algorithms to make faster the traversal of the additive ensemble or to early terminate the evaluation of documents that are unlikely to be ranked among the top-k. In this paper, we investigate the novel problem of query-level early exiting, aimed at deciding the profitability of early stopping the traversal of the ranking ensemble for all the candidate documents to be scored for a query, by simply returning a ranking based on the additive scores computed by a limited portion of the ensemble. Besides the obvious advantage on query latency and throughput, we address the possible positive impact on ranking effectiveness. To this end, we study the actual contribution of incremental portions of the tree ensemble to the ranking of the top-k documents scored for a given query. Our main finding is that queries exhibit different behaviors as scores are accumulated during the traversal of the ensemble and that query-level early stopping can remarkably improve ranking quality. We present a reproducible and comprehensive experimental evaluation, conducted on two public datasets, showing that query-level early exiting achieves an overall gain of up to 7.5% in terms of NDCG@10 with a speedup of the scoring process of up to 2.2x.Source: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2033–2036, Online Conference, 25-30 July, 2020
DOI: 10.1145/3397271.3401256
DOI: 10.48550/arxiv.2004.14641
Project(s): BigDataGrapes via OpenAIRE

Metrics:

2020 Journal article Open Access

RankEval: Evaluation and investigation of ranking models
Lucchese C., Muntean C. I., Nardini F. M., Perego R., Trani S.
RankEval is a Python open-source tool for the analysis and evaluation of ranking models based on ensembles of decision trees. Learning-to-Rank (LtR) approaches that generate tree-ensembles are considered the most effective solution for difficult ranking tasks and several impactful LtR libraries have been developed aimed at improving ranking quality and training efficiency. However, these libraries are not very helpful in terms of hyper-parameters tuning and in-depth analysis of the learned models, and even the implementation of most popular Information Retrieval (IR) metrics differ among them, thus making difficult to compare different models. RankEval overcomes these limitations by providing a unified environment where to perform an easy, comprehensive inspection and assessment of ranking models trained using different machine learning libraries. The tool focuses on ensuring efficiency, flexibility and extensibility and is fully interoperable with most popular LtR libraries.Source: Softwarex (Amsterdam) 12 (2020). doi:10.1016/j.softx.2020.100614
DOI: 10.1016/j.softx.2020.100614
Project(s): BigDataGrapes via OpenAIRE

Metrics:

See at: SoftwareX Open Access | ISTI Repository | SoftwareX | www.sciencedirect.com | CNR ExploRA

2018 Journal article Open Access

X-CLEaVER: Learning ranking ensembles by growing and pruning trees
Lucchese C., Nardini F. M., Orlando S., Perego R., Silvestri F., Trani S.
Learning-to-Rank (LtR) solutions are commonly used in large-scale information retrieval systems such as Web search engines, which have to return highly relevant documents in response to user query within fractions of seconds. The most effective LtR algorithms adopt a gradient boosting approach to build additive ensembles of weighted regression trees. Since the required ranking effectiveness is achieved with very large ensembles, the impact on response time and query throughput of these solutions is not negligible. In this article, we propose X-CLEaVER, an iterative meta-algorithm able to build more efcient and effective ranking ensembles. X-CLEaVER interleaves the iterations of a given gradient boosting learning algorithm with pruning and re-weighting phases. First, redundant trees are removed from the given ensemble, then the weights of the remaining trees are fne-tuned by optimizing the desired ranking quality metric. We propose and analyze several pruning strategies and we assess their benefts showing that interleaving pruning and re-weighting phases during learning is more effective than applying a single post-learning optimization step. Experiments conducted using two publicly available LtR datasets show that X-CLEaVER can be successfully exploited on top of several LtR algorithms as it is effective in optimizing the effectiveness of the learnt ensembles, thus obtaining more compact forests that hence are much more efcient at scoring time.Source: ACM transactions on intelligent systems and technology (Print) 9 (2018). doi:10.1145/3205453
DOI: 10.1145/3205453
DOI: 10.5281/zenodo.2668362
DOI: 10.5281/zenodo.2668361
Project(s): BigDataGrapes via OpenAIRE

, SoBigData via OpenAIRE

Metrics:

2018 Conference article Open Access

Selective gradient boosting for effective learning to rank
Lucchese C., Nardini F. M., Perego R., Orlando S., Trani S.
Learning an effective ranking function from a large number of query-document examples is a challenging task. Indeed, training sets where queries are associated with a few relevant documents and a large number of irrelevant ones are required to model real scenarios of Web search production systems, where a query can possibly retrieve thousands of matching documents, but only a few of them are actually relevant. In this paper, we propose Selective Gradient Boosting (SelGB), an algorithm addressing the Learning-to-Rank task by focusing on those irrelevant documents that are most likely to be mis-ranked, thus severely hindering the quality of the learned model. SelGB exploits a novel technique minimizing the mis-ranking risk, i.e., the probability that two randomly drawn instances are ranked incorrectly, within a gradient boosting process that iteratively generates an additive ensemble of decision trees. Specifically, at every iteration and on a per query basis, SelGB selectively chooses among the training instances a small sample of negative examples enhancing the discriminative power of the learned model. Reproducible and comprehensive experiments conducted on a publicly available dataset show that SelGB exploits the diversity and variety of the negative examples selected to train tree ensembles that outperform models generated by state-of-the-art algorithms by achieving improvements of NDCG@10 up to 3.2%.Source: International ACM Conference on Research and Development in Information Retrieval (SIGIR), pp. 155–164, 8-12/07/2018
DOI: 10.1145/3209978.3210048
DOI: 10.5281/zenodo.2668014
DOI: 10.5281/zenodo.2668013
Project(s): BigDataGrapes via OpenAIRE

, MASTER

, SoBigData via OpenAIRE

Metrics:

2018 Journal article Open Access

SEL: a unified algorithm for salient entity linking
Trani S., Lucchese C., Perego R., Losada D. E., Ceccarelli D., Orlando S.
The entity linking task consists in automatically identifying and linking the entities mentioned in a text to their uniform resource identifiers in a given knowledge base. This task is very challenging due to its natural language ambiguity. However, not all the entities mentioned in the document have the same utility in understanding the topics being discussed. Thus, the related problem of identifying the most relevant entities present in the document, also known as salient entities (SE), is attracting increasing interest. In this paper, we propose salient entity linking, a novel supervised 2-step algorithm comprehensively addressing both entity linking and saliency detection. The first step is aimed at identifying a set of candidate entities that are likely to be mentioned in the document. The second step, besides detecting linked entities, also scores them according to their saliency. Experiments conducted on 2 different data sets show that the proposed algorithm outperforms state-of-the-art competitors and is able to detect SE with high accuracy. Furthermore, we used salient entity linking for extractive text summarization. We found that entity saliency can be incorporated into text summarizers to extract salient sentences from text. The resulting summarizers outperform well-known summarization systems, proving the importance of using the SE information.Source: Computational intelligence 34 (2018): 2–29. doi:10.1111/coin.12147
DOI: 10.1111/coin.12147
Project(s): SoBigData via OpenAIRE

Metrics:

See at: ISTI Repository Open Access | Computational Intelligence Restricted | onlinelibrary.wiley.com | CNR ExploRA

2017 Conference article Restricted

RankEval: an evaluation and analysis framework for learning-to-rank solutions
Lucchese C., Muntean C. I., Nardini F. M., Perego R., Trani S.
In this demo paper we propose RankEval, an open-source tool for the analysis and evaluation of Learning-to-Rank (LtR) models based on ensembles of regression trees. Gradient Boosted Regression Trees (GBRT) is a flexible statistical learning technique for classification and regression at the state of the art for training effective LtR solutions. Indeed, the success of GBRT fostered the development of several open-source LtR libraries targeting efficiency of the learning phase and effectiveness of the resulting models. However, these libraries offer only very limited help for the tuning and evaluation of the trained models. In addition, the implementations provided for even the most traditional IR evaluation metrics differ from library to library, thus making the objective evaluation and comparison between trained models a difficult task. RankEval addresses these issues by providing a common ground for LtR libraries that offers useful and interoperable tools for a comprehensive comparison and in-depth analysis of ranking models.Source: SIGIR '17 - 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1281–1284, Tokyo, Japan, 9-11 August 2017
DOI: 10.1145/3077136.3084140
Project(s): SoBigData via OpenAIRE

Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2017 Conference article Restricted

X-DART: blending dropout and pruning for efficient learning to rank
Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
In this paper we propose X-DART, a new Learning to Rank algorithm focusing on the training of robust and compact ranking models. Motivated from the observation that the last trees of MART models impact the prediction of only a few instances of the training set, we borrow from the DART algorithm the dropout strategy consisting in temporarily dropping some of the trees from the ensemble while new weak learners are trained. However, differently from this algorithm we drop permanently these trees on the basis of smart choices driven by accuracy measured on the validation set. Experiments conducted on publicly available datasets shows that X-DART outperforms DART in training models providing the same effectiveness by employing up to 40% less trees.Source: SIGIR '17 - 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1077–1080, Tokyo, Japan, 9-11 August 2017
DOI: 10.1145/3077136.3080725
Metrics:

See at: doi.acm.org Restricted | doi.org | CNR ExploRA

2017 Conference article Open Access

The impact of negative samples on learning to rank
Lucchese C., Nardini F. M., Perego R., Trani S.
Learning-to-Rank (LtR) techniques leverage machine learning algorithms and large amounts of training data to induce high-quality ranking functions. Given a set of documents and a user query, these functions are able to predict a score for each of the documents that is in turn exploited to induce a relevance ranking. The effectiveness of these learned functions has been proved to be significantly affected by the data used to learn them. Several analysis and document selection strategies have been proposed in the past to deal with this aspect. In this paper we review the state-of-the-art proposals and we report the results of a preliminary investigation of a new sampling strategy aimed at reducing the number of not relevant query-document pairs, so to significantly decrease the training time of the learning algorithm and to increase the final effectiveness of the model by reducing noise and redundancy in the training set.Source: 1st International Workshop on LEARning Next GEneration Rankers, LEARNER 2017, Amsterdam, Netherlands, 1 October, 2017

See at: ceur-ws.org Open Access | ISTI Repository | CNR ExploRA

2017 Doctoral thesis Open Access

Improving the Efficiency and Effectiveness of Document Understanding in Web Search
Trani S.
Web Search Engines (WSEs) are probably nowadays the most complex information systems since they need to handle an ever-increasing amount of web pages and match them with the information needs expressed in short and often ambiguous queries by a multitude of heterogeneous users. In addressing this challenging task they have to deal at an unprecedented scale with two classic and contrasting IR problems: the satisfaction of effectiveness requirements and efficiency constraints. While the former refers to the user-perceived quality of query results, the latter regards the time spent by the system in retrieving and presenting them to the user. Due to the importance of text data in the Web, natural language understanding techniques acquired popularity in the latest years and are profitably exploited by WSEs to overcome ambiguities in natural language queries given for example by polysemy and synonymy. A promising approach in this direction is represented by the so-called Web of Data, a paradigm shift which originates from the Semantic Web and promotes the enrichment of Web documents with the semantic concepts they refer to. Enriching unstructured text with an entity-based representation of documents - where entities can precisely identify persons, companies, locations, etc. - allows in fact, a remarkable improvement of retrieval effectiveness to be achieved. In this thesis, we argue that it is possible to improve both efficiency and effectiveness of document understanding in Web search by exploiting learning-to-rank, i.e., a supervised technique aimed at learning effective ranking functions from training data. Indeed, on one hand, enriching documents with machine-learnt semantic annotations leads to an improvement of WSE effectiveness, since the retrieval of relevant documents can exploit a finer comprehension of the documents. On the other hand, by enhancing the efficiency of learning to rank techniques we can improve both WSE efficiency and effectiveness, since a faster ranking technique can reduce query processing time or, alternatively, allow a more complex and accurate ranking model to be deployed. The contribution of this thesis are manifold: i) we discuss a novel machine- learnt measure for estimating the relatedness among entities mentioned in a document, thus enhancing the accuracy of text disambiguation techniques for document understanding; ii) we propose novel machine-learnt technique to label the mentioned entities according to a notion of saliency, where the most salient entities are those that have the highest utility in understanding the topics discussed; iii) we enhance state-of-the-art ensemble-based ranking models by means of a general learning-to-rank framework that is able to iteratively prune the less useful part of the ensemble and re-weight the remaining part accordingly to the loss function adopted. Finally, we share with the research community working in this area several open source tools to promote collaborative developments and favoring the reproducibility of research results.

See at: etd.adm.unipi.it Open Access | ISTI Repository | CNR ExploRA

2016 Conference article Restricted

Post-learning optimization of tree ensembles for efficient ranking
Lucchese C., Perego R., Nardini F. M., Silvestri F., Orlando S., Trani S.
Learning to Rank (LtR) is the machine learning method of choice for producing high quality document ranking functions from a ground-truth of training examples. In practice, efficiency and effectiveness are intertwined concepts and trading off effectiveness for meeting efficiency constraints typically existing in large-scale systems is one of the most urgent issues. In this paper we propose a new framework, named CLEaVER, for optimizing machine-learned ranking models based on ensembles of regression trees. The goal is to improve efficiency at document scoring time without affecting quality. Since the cost of an ensemble is linear in its size, CLEaVER first removes a subset of the trees in the ensemble, and then fine-tunes the weights of the remaining trees according to any given quality measure. Experiments conducted on two publicly available LtR datasets show that CLEaVER is able to prune up to 80% of the trees and provides an efficiency speed-up up to 2.6x without affecting the effectiveness of the model.Source: 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 949–952, Pisa, Italy, 17-21 July 2016
DOI: 10.1145/2911451.2914763
Project(s): SoBigData via OpenAIRE

Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2016 Conference article Restricted

SEL: A unified algorithm for entity linking and saliency detection
Trani S., Ceccarelli D., Lucchese C., Orlando S., Perego R.
The Entity Linking task consists in automatically identifying and linking the entities mentioned in a text to their URIs in a given Knowledge Base, e.g., Wikipedia. Entity Linking has a large impact in several text analysis and information retrieval related tasks. This task is very challenging due to natural language ambiguity. However, not all the entities mentioned in a document have the same relevance and utility in understanding the topics being discussed. Thus, the related problem of identifying the most relevant entities present in a document, also known as Salient Entities, is attracting increasing interest. In this paper we propose SEL, a novel supervised two-step algorithm comprehensively addressing both entity linking and saliency detection. The first step is based on a classifier aimed at identifying a set of candidate entities that are likely to be mentioned in the document, thus maximizing the precision of the method without hindering its recall. The second step is still based on machine learning, and aims at choosing from the previous set the entities that actually occur in the document. Indeed, we tested two different versions of the second step, one aimed at solving only the entity linking task, and the other that, besides detecting linked entities, also scores them according to their saliency. Experiments conducted on two different datasets show that the proposed algorithm outperforms state-of-the-art competitors, and is able to detect salient entities with high accuracy.Source: ACM Symposium on Document Engineering, pp. 85–94, Vienna, Austria, 13-16 September 2016
DOI: 10.1145/2960811.2960819
Project(s): SoBigData via OpenAIRE

Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2016 Contribution to conference Restricted

Improve ranking efficiency by optimizing tree ensembles
Lucchese C., Nardini F. M., Orlando S., Perego R., Silvestri F., Trani S.
Learning to Rank (LtR) is the machine learning method of choice for producing highly effective ranking functions. However, efficiency and effectiveness are two competing forces and trading off effiectiveness for meeting efficiency constraints typical of production systems is one of the most urgent issues. This extended abstract shortly summarizes the work in [4] proposing CLEaVER, a new framework for optimizing LtR models based on ensembles of regression trees. We summarize the results of a comprehensive evaluation showing that CLEaVER is able to prune up to 80% of the trees and provides an efficiency speed-up up to 2:6x without affecting the effectiveness of the model.Source: 7th Italian Information Retrieval Workshop, Venezia, Italia, 30-31 May 2016

See at: ceur-ws.org Restricted | CNR ExploRA

2015 Conference article Open Access

Entity linking on philosophical documents
Trani S., Ceccarelli D., De Francesco A., Perego R., Segala M., Tonellotto N.
Entity Linking consists in automatically enriching a document by detecting the text fragments mentioning a given entity in an external knowledge base, e.g., Wikipedia. This problem is a hot research topic due to its impact in several text-understanding related tasks. However, its application to some specfiic, restricted topic domains has not received much attention. In this work we study how we can improve entity linking performance by exploiting a domain-oriented knowledge base, obtained by filtering out from Wikipedia the entities that are not relevant for the target domain. We focus on the philosophical domain, and we experiment a combination of three different entity filtering approaches: one based on the \Philosophy" category of Wikipedia, and two based on similarity metrics between philosophical documents and the textual description of the entities in the knowledge base, namely cosine similarity and Kullback-Leibler divergence. We apply traditional entity linking strategies to the domainoriented knowledge base obtained with these filtering techniques. Finally, we use the resulting enriched documents to conduct a preliminary user study with an expert in the area.Source: Italian Information Retrieval Workshop, pp. 12–12, Cagliari, Italy, 25-26/05/2015

See at: ceur-ws.org Open Access | CNR ExploRA